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Abstract 

Efficient unsupervised training and inference in 
deep generative models remains a challenging 
problem. One basic approach, called Helmholtz 
machine or Variational Autoencoder, involves 
training a top-down directed generative model 
together with a bottom-up auxiliary model used 
for approximate inference. Recent results in¬ 
dicate that better generative models can be ob¬ 
tained with better approximate inference proce¬ 
dures. Instead of improving the inference proce¬ 
dure, we here propose a new model, the bidirec¬ 
tional Helmholtz machine, which guarantees that 
the top-down and bottom-up distributions can ef¬ 
ficiently invert each other. We achieve this by in¬ 
terpreting both the top-down and the bottom-up 
directed models as approximate inference distri¬ 
butions and by defining the model distribution to 
be the geometric mean of these two. We present a 
lower-bound for the likelihood of this model and 
we show that optimizing this bound regularizes 
the model so that the Bhattacharyya distance be¬ 
tween the bottom-up and top-down approximate 
distributions is minimized. This approach results 
in state of the art generative models which prefer 
significantly deeper architectures while it allows 
for orders of magnitude more efficient likelihood 
estimation. 


1. Introduction and background 

Training good generative models and fitting them to com¬ 
plex and high dimensional training data with probability 
mass in multiple disjunct locations remains a major chal¬ 
lenge. This is especially true for models with multiple lay¬ 
ers of deterministic or stochastic variables, which is un- 
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fortunate because it has been argued previously (Hinton 
et al., 2006; Bengio, 2009) that deeper generative models 
have the potential to capture higher-level abstractions and 
thus generalize better. Although there has been progress 
in dealing with continous-valued latent variables (Kingma 
& Welling, 2014), building a hierarchy of representations, 
especially with discrete-valued latent variables, remains a 
challenge. 

With the Helmholtz machine (Hinton et al., 1995; Dayan 
et al., 1995), a concept was introduced that proposed to not 
only fit a powerful but intractable generative model p(x, h) 
to the training data, but also to jointly train a parametric 
approximate inference model g(h|x). The distribution q is 
used to perform approximate inference over the latent vari¬ 
ables h of the generative model given an observed example 
x, i.e. to approximate p(h|x). This basic idea has been 
applied and enhanced many times; initially with the wake- 
sleep algorithm (WS, Hinton et al. (1995); Dayan & Hinton 
(1996)) and more recently with the variational autoencoder 
(VAE, Kingma & Welling (2014)), stochastic backpropa- 
gation and approximate inference in deep generative mod¬ 
els (Rezende et al., 2014), neural variational inference and 
learning (NVIL, Mnih & Gregor (2014)) and reweighted 
wake-sleep (RWS, Bornschein & Bengio (2015)). 

Recent results indicate that significant improvements can 
be made when better approximate inference methods are 
used: Salimans et al. (2014) for example presented an iter¬ 
ative inference procedure that improves the samples from 
q by employing a learned MCMC transition operator. In 
(Hjelm et al., 2015) the authors propose an iterative, gra¬ 
dient based procedure to refine the approximate posterior. 
Burda et al. (2015) present the importance weighted auto 
encoder (IWAE), an improved VAE that, similarly to RWS, 
uses multiple samples from q to calculate gradients. And 
RWS already reported that autoregressive q distributions 
lead to noticable improvements. In contrast to these previ¬ 
ous approaches that aimed at incorporating more powerful 
inference methods to gain better approximations of p(h|x), 
we here propose to regularize the top-down model p such 
that its posterior stays close to the approximate inference 
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distribution q (and vice versa). We archive this by interpret¬ 
ing both p and q as approximate inference models for our 
actual generative model p*, which is defined to be the geo¬ 
metric mean over the top-down and bottom-up approximate 
inference models, i.e., p*(x, h) = 1 /zy / p(x, h)g(x, h). 

In Section 2 we show that this model definition leads to 
an objective that can be interpreted as using a regulariza¬ 
tion term that encurages solutions where p and q are close 
to each other in terms of the Bhattacharyya distance. In 
Section 3 we will explain how to perform importance sam¬ 
pling based training and inference. The ability to model 
complex distributions and the computational efficiency of 
this approach are demonstrated empirically in Section 4. 

2. Model definition and properties 

We introduce the bidirectional Helmholtz Machine (BiHM) 
by defining a joint probability distribution over three vari¬ 
able vectors, an observed vector x and two latent variable 
vectors hi and h 2 . Analogous to a Deep Boltzmann Ma¬ 
chine (DBM, Salakhutdinov & Hinton (2009)), we think of 
these as layers in a neural network with links between x 
and hi on the one side, and hi and h 2 on the other side. 
We will present our approach for the specific case of an 
architecture with two hidden layers, but it can be applied 
to arbitrary graphs of variables without loops. It can es¬ 
pecially be used to train architectures with more than two 
stacked layers of latent variables. 

Let p*(x, hi,h 2 ) be a joint probability distribution con¬ 
structed in a specific way from two constituent distributions 
p(x,hi,h 2 ) and q(x,hi,h 2 ), 

P *( x , hi, h 2 ) = 2 x /p( x ,hi,h 2 ) </(x, hi, h 2 ) , 

where Z is a normalization constant and p and q are di¬ 
rected graphical models from h 2 to x and vice versa, 

p(x,h l5 h 2 ) =p(h 2 ) p(hi|h 2 ) p(x|hi) and 
<?(x,hi,h 2 ) = q(x) <?(hi|x) q(h 2 |hi) . 

We assume that the prior distribution p( h 2 ) and all condi- 
tional distributions belong to parametrized families of dis¬ 
tributions which can be evaluated and sampled from effi¬ 
ciently. For q(x) we do not assume an explicit form but 
define it to be the marginal 

q(x) = p* (x) = P*(x,hi,h 2 ) 

hi,h 2 

= ZZZl ^2 \J p(x, hi,h 2 ) <?(hi|x)g(h 2 |hi) 

hi,h 2 

= Z \/p( x > hi,h 2 ) g(hi|x)g(h 2 |hi)) . 

hi,h 2 


The normalization constant Z guarantees that 
J] x h h P*( x , hi,h 2 ) = 1. Using the Cauchy-Schwarz 

inequality | E y f(y)g(v)\ 2 < T, v \f(v )\ 2 x T, y \9{v)\ 2 
and identifying a/p(x, hi, h 2 ) with f(y) and 
\Jq{x, hj, h 2 ) with g(y), it becomes clear that 

Z = Ex.hi.ha V P( x > hi, h 2 )</(x, hi, h 2 ) < 1 for 

arbitrary p and q. Furthermore, we see that Z = 1 if and 
only if p(x, hi,h 2 ) = q(x, hi,h 2 ). We can therefore 
obtain a lower bound on the marginal probability p* (x) by 
defining 

p*(x) = ( Z \/p(x> hi,h 2 )q(hi|x) q(h 2 |hi)) 

hi,h 2 

= Z 2 p*(x) < p*(x) . (1) 

This suggests that the model distribution p* (x) can be fitted 
to some training data by maximizing the bound of the log- 
likelihood (LL) logp*(x) instead of logp*(x), as we elab¬ 
orate in the following section. Since logp*(x) can reach 
the maximum only when Z —» 1, the model is implicitly 
pressured to find a maximum likelihood solution that yields 
P(x,hi,h 2 ) « q(x,hi,h 2 ) « p*(x, hi, h 2 ). 

2.1. Alternative view based on the Bhattacharyya 
distance 

Recalling the Bhattacharyya distance D B (p,q) = 
- log Y,y Vp(y)Q(y) ( for which holds D B (p,q) > 0 for 
arbitrary distributions p, q and D B (p, q) = 0 only if p = q) 
the model LL logp*(x) can be decomposed into 

logp*(x) = 2 log ^2 \/p(x, hi,h 2 )q(hi|x)q(h 2 |hi) 

hi,h 2 

- 2 log ^2 h i> K)ZZ, hi, h' 2 ) 

x', h m 

= log p* (x) + 2 D b (p, q) > log p* (x) , (2) 

where we clearly see that the proposed training objective 
log p* (x) corresponds to the LL log p* (x) minus 2 times 
the Bhattacharyya distance D B (p , q), i.e., it is maximzing 
the true LL and minimizing the distance between p and q. 
We can compare this to the variational approach, where the 
marginal probability logp(x) of some model containing la¬ 
tent variables h is rewritten in terms of the KL-divergence 
D KL (q(h\x) ||p(h|x)) = tf( h l x ) log > 0 to 

obtain a lower bound 

logp(x) = E [logp(x, h) - log q(h|x)] 

h~g(h|x) 

+ D KL (q(h\x) ||p(h|x)) 

> E [logp(x,h)-logq(h|x)] . (3) 

h~g(h|x) 

















Analogous to variational methods that maximize the lower 
bound (3), we can thus maximize log p*(x), and it will 
tighten the bound as Dsip^q) approaches zero. While 
this seems very similar to the variational lower bound, we 
should highlight that there are some important conceptual 
differences: 1) The KL-divergence in variational methods 
measures the distance between distributions given some 
training data. The Bhattacharyya distance here in contrast 
quantifies a property of the model p*(x, hi, h 2 ) indepen¬ 
dently of any training data. In fact, we saw that D B {p , q) = 
— log Z. 2) The variational lower bound is typically used 
to construct approximate inference algorithms. We here use 
our bound p* (x) just to remove the normalization constant 
Z from our target distribution p* (x) . Even after applying 
the lower-bound, we still have to tackle the inference prob¬ 
lem which manifests itself in form of the full combinatorial 
sum over hi and h 2 in equation (1). Although it seems in¬ 
tuitively reasonable to use a variational approximation on 
top of the bound p* (x) we will here not follow this direc¬ 
tion but rather use importance sampling to perform approx¬ 
imate inference and learning (see section 3). Combining a 
variational method with the bound p* (x) is therefore sub¬ 
ject to future work. 

We can also argue that optimizing logp*(x) instead of 
logp*(x) is beneficial in the light of the original goal 
we formulated in section 1: To learn a generative model 
p* that is regularized to be close to the model q which 
we use to perform approximate inference for p*. Let 
us assume we have two equally well trained models pj^ 
and p* e2 , i.e., in expectation over the empirical distri¬ 
bution E [logpJ i (x)] = E [logp 0 2 (x)j, but the ex¬ 
pected bound p*(x) for the first model is closer to 
the LL than the expected bound for the second model: 
E [log^px)] > E [logp 0 2 (x)]. Using equation (2) we 
see that D B {po 1 ,qo 1 ) < D B {p0 2 ,qe 2 ) which indicates 
that qo i is closer to p^ than qe 2 is to p ^ 2 (when we measure 
their distance using the Bhattacharyya distance). Accord¬ 
ing to our original goal, we thus prefer solution p* Qi , where 
the bound p*(x) is maximized and the distance Db(p , q) 
minimized. 

Note that the decomposition (2) also emphasizes why our 
recursive definition q(x) = h2 p*(x, hi, h 2 ) is a 

consistent and reasonable one: minimizing D B (p, q) dur¬ 
ing learning means that the joint distributions p(x, hi, h 2 ) 
and #(x, hi,h 2 ) approach each other. This implies that 
the marginals p(h/) and q(hi) for all layers l become 
more similar. This also implies p(x) « q(x) in the 
limit of Dsip^q) 0 ; a requirement that most simple 
parametrized distributions g(x) could never fulfill. 


Algorithm 1 Training p* (x) using K importance samples 
for number of training iterations do 

• Sample x from the training distribution (i.e. x ~ V) 
for k = 1, 2,..., K do 

• Sample ~ g(h^|x); ~ g(h[^|h^) 

(for layers l = 2 to L) 

• Compute g(h^|x) andp(x, h^) 

(forh(*> = (hP,...,hP)) 

end for 

• Compute unnormalized importance weights 

id k = ^/p(x,h (fc ))/qr(h( fe )|x) 

• Normalize the weights = UJk /j2 k ' 

• Update parameters of p and q: gradient descent 

with gradient estimator 
Efc u k §9 l°gp(x, h( fc ))g(hW |x) 

end for 


section we can define a wide range of possible models. Fur¬ 
thermore, we have a wide range of potential training and 
appropriate inference methods we could employ to maxi¬ 
mize log p* (x). 

In this text we concentrate on binary latent and observed 
variables x,hi,h 2 and model all our conditional distri¬ 
butions by simple sigmoid belief network layers, e.g., 
p(x|hi) = Y[ i B(x i \(j(Wi hi + bi)) where B{xi\c) 
refers to the Bernoulli distribution with P(xi #= 1) = c, Wi 
are the connection weights between the latent variables 
hi and the visible variable xp bi is the bias of and 
cr(-) is the sigmoid function. For our top-level prior 
p(h 2 ), we use a factorized Bernoulli distribution: p(h 2 ) = 
Ili I cr(b 2 ,i)). 

We form an estimate of p* (x) by using importance sam¬ 
pling instead of the exhaustive sum over hi and h 2 in equa¬ 
tion (1). We use q(hi |x)g(h 2 |hi) as the proposal distribu¬ 
tion which is by construction easy to evaluate and to sample 
from: 


5*( x ) = ( E V / P(x,hi,h 2 ) <?(hi|x) </(h 2 |h 

-( 


hi,h 2 


E 

. h 2 ~g(h 2 |hi) 
hi~g(hi|x) 


/ p(x,hi,h 2 ) 
9 (hi|x) </(h 2 |hi) 


)' 


1 K 


p(x, hP,hP) 


P \ 9(hf } |x) q(h { 2 k) 


h ( fe V 


(4) 


3. Inference and training 

Based on the construction of p* (x) outlined in the previous 


with h^ ~ g(hi|x) and h^ ~ g(h 2 |h^). Using the 
same approach, we can also derive the well known esti¬ 
mator for the marginal probability of a datapoint under the 
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Figure 1. MNIST experiments: A) Random samples from top-down model p(x). B) Generally improved samples after running 10 
iterations of Gibbs sampling to obtain approximate samples from the joint model p*(x). In A and B we show expected samples instead 
of sampling from the bottom Bernoulli distribution, as usual. C) Sensitivity of the test set LL estimates to the number of samples K. 
We plot the test set logp(x) and logp* (x) estimates of our best BiHM model together with the logp(x) estimates of our best RWS and 
YAE models. 
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Figure 2. Inpainting of binarized MNIST digits. The left column in each block shows the original digit randomly sampled from the 
MNIST test set; the first column shows the masked version presented to the algorithm, and the next three columns show different, 
independent reconstructions after 100 Gibbs iterations, (see section 3). All images were selected randomly. 


top-down generative model p: 


p(x) = e 

ha^halh!) |_7(hi|x) g(h 2 |hi) 
hi~<?(hi|x) 

_ J_ A p(x,h^ } ,h^ fc) ) 

lx) q(hP\hP) 


(5) 


Comparing (4) and (5) and making use of Jensen’s inequal¬ 
ity it becomes clear that p(x) > p* (x). 

Analogous to the parameter updates in RWS (Bornschein 
& Bengio, 2015), we can derive an importance sampling 
based estimate for the LL gradient with respect to the pa¬ 
rameters of p and q (jointly denoted by 0) and use it to op¬ 
timize our objective (we use h to jointly denote the latent 
variables of all layers) 

A log p* (x) = ^ log ( 53 vlA-hM h l x )) 

h 

= 2 Eh Vp(*i h M h l x ) m lQ g Vpj*-’ h M h l x ) 

Eh' \M X > h'M h 'l x ) 

K „ 

- ^2^00 lo S-P( x , h(&) )q , ( h(fe) I x ) , (6) 

k=1 

with rsj q(h | x) and importance weights 


In contrast to VAEs and IWAEs, the updates do not require 
any form of backpropagation through layers because, as far 
as the gradient computation log >/p(x, h( fe ))#(h( fe ) |x) 
is concerned, these samples are considered fully observed. 
The gradient approximation (6) computes the weighted av¬ 
erage over the individual gradients. These properties are 
basically inherited from the RWS training algorithm. But 
in contrast to RWS, and in contrast to most other algorithms 
which employ a generative model p and an approximate in¬ 
ference model q , we here automatically obtain parameter 
updates for both p and q because we optimize p* which 
contains both. The resulting training method is summa¬ 
rized in algorithm 1 . 


Estimating the partition function Z To compute 
p* (x) = ^2 p* (x) and to monitor the training progress it 
is desirable to estimate the normalization constant Z. In 
stark contrast to undirected models like RBMs or DBMs, 
we can here derive an unbiased importance sampling based 
estimator for Z 2 : 


Z 2 = E 


q( h|x) 


E 


,h~p(x,h) V £>(x, h) h'~g(h|x) 


/P(h ; ,x) 

< /(h , |x) 


= E 

x,h~p(x,h) 
h / ~g(h / |x) 


/ p(x, h ; )(j(h|x) 
P(x, h)<?(h'|x) 


(7) 


Wfc 


E k' 


where 


Uk 


p(x, h^)) 
g(h( fc )|x) 


We denote the number of samples used to approximate the 
outer expectation and the inner expectation with K outer and 
dinner respectively. In the experimental section, we show 
that we obtain high quality estimates for Z 2 with K mnQr =\ 


for k = 1,..., K. 




































































































































and a relatively small number of samples K 0uter . By tak¬ 
ing the logarithm, we obtain a biased estimator for 2 log Z, 
which will, unfortunately, underestimate 2 log Z on aver¬ 
age due to the concavity of the logarithm and the vari¬ 
ance of the Z 2 estimate. This can lead to overestimates 
for logp*(x) (see equation (4)) if we are not careful. For¬ 
tunately, the bias on the estimated log Z is induced only 
by the concavity of the logarithm; the underlying estimator 
for Z 2 is unbiased. We can thus effectively minimize the 
bias by minimizing the variance of the Z 2 estimate (e.g. 
by taking more samples). This is a much better situation 
than for Z-estimating methods that rely on Markov chains 
in high dimensional spaces, which might miss entire modes 
because of mixing issues. 

Sampling and inpainting We now discuss two general 
approaches for approximate sampling from a BiHM. One 
can either easily and efficiently sample from the directed 
model p , or one can use Gibbs sampling to draw higher- 
quality samples from the undirected model p*. For the lat¬ 
ter, importance resampling is used to approximately draw 
samples from the conditional distributions, e.g. from: 

p*(hp|x,h 2 ) = 

V /, p(h! fe) |h2)p(x|hf ) )g(hf ) |x)g(h 2 |hP) 

Eh! V P( h i|h 2 )p(x|hi)</(hi|x)</(h 2 |hi) 

Here we choose to draw the proposal samples from the 
mixture distribution 1 / 2 p(hi |h 2 ) + V 2 (?(hi l x )> which en¬ 
sures that we have a symmetric chance of covering the 
high probability configurations of p* (hi |x, I 12 ) induced by 
p and q. We resample a final sample from p*(hi|x, h 2 ) 
propotionally to their importance weights which are thus 
given by 

(fe) _ /P( h i fc) |h 2 )p(x|hP)g(hP|x)g(h 2 |h~P) 

p(hP|h 2 ) + q(hP|x) 

where is randomly drawn from p(hi|h 2 ) or g(hi|x) 

(k) 

and the constant C collects all terms not containing h) ; 
and can be ignored. For p*(x|hi) we choose to sample by 
drawing the proposal samples fromp(x|hi). We iteratively 
update all odd layers followed by all even layers until we 
consider the chain to be in equilibrium. 

Equipped with approximate sampling procedures for the 
conditional distributions, it is straightforward to construct 
an algorithm for inpainting: Given a corrupted input dat- 
apoint x, we first initialize a Markov chain by drawing 
hi,h 2 ^ g(hi,h 2 |x) and then run the Gibbs sampling 
procedure. Whenever we sample the bottom layer x ^ 
p*(x|hi) , we keep the non-corrupted elements of x fixed. 


4. Experimental results 

In this section we present experimental results obtained 
when applying the algorithm to various binary datasets. 
Our main goal is to ensure that the theoretical proper¬ 
ties discussed in section 2 translate into a robust algo¬ 
rithm that yields competitive results even when used with 
simple sigmoid belief network layers as conditional dis¬ 
tributions. We train all models using Adam (Kingma & 
Ba, 2014) with a mini-batch size of 100. We initial¬ 
ize the weights according to Glorot & Bengio (2010), 
set the biases to -1, and use L\ regularization A=10 -3 
on all the weights. Our implementation is available at 
https://github.com/jbornschein/bihm. 

UCI binary datasets To ascertain that importance sam¬ 
pling based training of BiHMs works in general, we applied 
it to the 8 binary datasets from the UCI dataset repository 
that were evaluated e.g. in (Larochelle & Murray, 2011). 
We use a learning rate of 10“ 2 or 10 -3 for all the experi¬ 
ments. The architectures, layer sizes and final LL estimates 
can be found in tables 3 and 4. 


Dataset 

BiHM layer sizes 

RWS layer sizes 

ADULT 

100, 70, 50, 25 

100, 20, 5 

CONNECT4 

300, 110, 50, 20 

150, 50, 10 

DNA 

200,150,130,100,70,50,30,20,10 

150, 10 

MUSHROOM 

150, 100, 90, 60, 40, 20 

150, 50, 10 

NIPS-0-12 

200, 100, 50, 25 

150, 50, 10 

OCR 

600, 500, 100, 50, 30, 10 

300, 100, 10 

RCV1 

500, 120, 70, 30 

200, 50, 10 

WEB 

650, 580, 70, 30, 10 

300, 50, 10 


Table 2. Architectures for our best UCI BiHM models compared 
to our best RWS models. We observe that BiHMs prefer signifi¬ 
cantly deeper architectures than RWS. 

Binarized MNIST We use the MNIST dataset that was 
binarized according to (Murray & Salakhutdinov, 2009) 
and which we downloaded in binarized form (Larochelle, 
2011). Compared to RWS, we again observe that BiHMs 
prefer significantly deeper and narrower models. Our 
best model consists of 1 visible and 12 latent layers 
with 300,200,100,75,50,35,30,25,20,15,10,10 latent vari¬ 
ables. We follow the same experimental procedure as in the 
RWS paper: First train the model with K -10 samples and a 
learning rate of 10 -3 until convergence and then fine-tune 
the parameters with K -100 samples and a learning rate of 
3 x 10 -4 . All layers are actually used to model the empiri¬ 
cal distribution; we confirmed that training shallower mod¬ 
els (obtained by leaving out individual layers) decreases 
the performance. We obtain test set log-loglikelihoods of 
logp*(x) ~ -84.6 ± 0.23 and logp(x) ~ -84.3 =b 0.22 
*. The next section presents a more detailed analysis of 

Recently, a version of MNIST where binary observed vari¬ 
ables are resampled during training has been used (e.g., Burda 
et al. (2015)). On this dataset we obtain logp(x) ~ -82.9 















Figure 3. log Z 2 estimates for different values of if inner as a function of A) the number of samples if outer, B) the total number of samples 
if inner • if outer for the BiHM trained on MNIST; the gray region shows the mean and the standard deviation for 10 runs with K - mncr -\. 
This shows that, from the point of view of total computation, convergence is fastest with ifmner=l; and that we obtain a high quality 
estimate of the partition function with only a few million samples. C) Evolution of the estimates of logp(x), logp*(x), and 2 log Z 
during training on MNIST. 


Model 

ADULT 

CONNECT4 

DNA 

MUSHROOMS 

NIPS-0-12 

OCR-LETTERS 

RCVl 

WEB 

auto regressive 









NADE 

13.19 

11.99 

84.81 

9.81 

273.08 

27.22 

46.66 

28.39 

EoNADE 

13.19 

12.58 

82.31 

9.68 

272.38 

27.31 

46.12 

27.87 

DARN 

13.19 

11.91 

81.04 

9.55 

274.68 

28.17 

46.10 

28.83 

RWS-NADE 

13.16 

11.68 

84.26 

9.71 

271.11 

26.43 

46.09 

27.92 

non AR 









RBM 

16.26 

22.66 

96.74 

15.15 

277.37 

43.05 

48.88 

29.38 

RWS-SBN 

13.65 

12.68 

90.63 

9.90 

272.54 

29.99 

46.16 

28.18 

BiHM 









-logp(x) 

13.78 

12.43 

86.49 

9.40 

272.66 

27.10 

46.12 

28.14 

-logp*(x) 

13.82 

12.31 

86.92 

9.40 

272.71 

27.30 

46.98 

28.22 

-2 log Z 

0.20 

0.27 

0.56 

0.09 

1.97 

1.87 

0.41 

0.54 

ess 

81.5% 

89.1% 

11.2% 

92.5% 

16.8% 

22.5% 

70.6% 

55.9% 


Table 1. Negative log-likelihood (NLL) on various binary datasets from the UCI repository: The top rows quote results from shallow 
models with autoregressive weights between their units within one layer. The second block shows results from non-autoregressive models 
(quoted from Bornschein & Bengio (2015)). In the third block we show the results obtained by training a BiHMs. We report the estimated 
test set NLL when evaluating just the top-down model, logp(x), and when evaluating logp* (x). Our BiHM models consistently obtain 
similar or better results than RWS while they prefer deeper architectures. All effective sample size (see 4.1) estimates are with error 
bounds smaller than ±0.2%. 


these estimates and their dependency on the number of 
samples from the proposal distribution g(h|x). Note that 
even though this model is relatively deep, it is not partic¬ 
ularly large, with about 700, 000 parameters in total. The 
DBMs in (Salakhutdinov & Hinton, 2009) contain about 
900, 000 and 1.1 million parameters; a variational autoen¬ 
coder with two deterministic, 500 units wide encoder and 
decoder layers, and with 100 top level latent units contains 
more than 1.4 million parameters. 

To highlight the models ability to generate crisp (non- 
blurry) digits we draw approximate samples from p*(x ) 
which are visualized in Fig. 1 B. Fig. 1 A shows sam¬ 
ples obtained when drawing from the top-down generative 
model p(x) before running any Gibbs iterations. Fig. 2 vi¬ 
sualizes the results when running the inpainting algorihm 
to reconstruct partially occluded images. Our sourcecode 
package contains additional results and animations. 


Model 

< - log p(x) 

~ - logp(x) 

autoregressive models 



NADE 

— 

88.9 

DARN (1 latent layer) 

88.3 

84.1 

continuous latent variables 



VAE 

96.2 

88.7 

VAE + HMC (8 iterations) 

88.3 

85.5 

IWAE (2 latent layers) 

96.1 

85.3 

binary latent variables 



NVIL (2 latent layers) 

99.6 

— 

RWS (5 latent layers) 

96.3 

85.4 

BiHM (12 latent layers) 

89.2 

84.3 


Table 3. Comparison of BiHMs to other recent methods in the 
literature. We report the lower bounds and estimates for the 
marginal log probability on the binarized MNIST test set. 

Toronto Face Database We also trained models on the 
98,058 examples from the unlabeled section of the Toronto 
face database (TFD, Susskind et al. (2010)). Each train¬ 
ing example is of size 48 x 48 pixels and we interpret 
the gray-level as Bernoulli probability for the bottom layer. 
We observe that training proceeds rapidly during the first 





















































































































few epochs but mostly only learns the mean-face. Dur¬ 
ing the next few hundred epochs training proceeds much 
slower but the estimated log-likelihood log p* (x) increases 
steadily. Fig. 5 A shows random samples from a model 
with respectively 1000,700,700,300 latent variables in the 4 
hidden layers. It was trained with a learning rate of 3 • 10“ 5 ; 
all other hyperparameters were set to the same values as 
before. Fig. 5 B shows the results from inpainting experi¬ 
ments with this model. 

4.1. Analysis of importance sampling-based estimates 

Estimating the partition function In Fig. 3 A we plot 

2 log Z estimates (equation (7)) over the number of outer 
samples AT 0 uter for our best MNIST model and for 3 dif¬ 
ferent choices of Thinner, he.dinner £ {1,10,100}. In Fig. 

3 B we plot the estimates over the total number of samples 
Pouter-dinner- We observe that choosing Thinner =1 and using 
only about 10 million samples results in high quality esti¬ 
mates for 2 log Z with an standard error far below 0.1 nats. 
Estimating based on 10 million samples takes less than 2 
minutes on a GTX980 GPU. Fig. 3 C shows the develop¬ 
ment of the 2 log Z estimate during learning and in relation 
to the LL estimates. 

Importance sampling efficiency A widely used metric 
to estimate the quality of an importance sampling estima¬ 
tor is the effective sampling size (ESS), given by ess = 
(Ef=i w fc) 2 /(Ef=i Wfc) (see, e.g., Robert & Casella (2009)). 
Larger values indicate more efficient sampling (more in¬ 
formation extracted per sample). We compute the ESS 
over the MNIST test set for AT=100,000 proposal sam¬ 
ples from g(h|x). For our best RWS model, a model 
with 5 stochastic layers (400,300,200,100,10), we obtain 
ess ~ 0.10% ± 0.06; for the BiHM model we obtain 
ess ~ 11.9% ±1.1. When we estimate the ESS for us¬ 
ing g(h|x) from the BiHM as a proposal distribution for 
p(h|x), we obtain ess=1.2% ± 0.2. The estimated ESS 
values indicate that training BiHM models indeed results 
in distributions whose intractable posterior p*(h|x) as well 
as top-down model p(h|x) are much better modeled by the 
learned g(h|x). We also estimated the ESS for a VAE with 
two determninistic, 500 units wide ReLU layers in the en¬ 
coder and decoder. This model has a single stochastic layer 
with 100 continuous variables at the top; it reaches a final 
estimated test set LL of logp(x) ~ —88.9 ± 0.28. The 
final variational lower bound, which corresponds exactly 
to the importance sampling estimate of logp(x) with K= 1 
sample, is —95.8. For this model we obtain an ESS of 
0.07% ± 0.02. These results indicate that we need thou¬ 
sands of samples to obtain reliable LL estimates with low 
approximation error. In Fig. 1 C we plot the estimated 
test set LL over the number of samples K used to esti¬ 
mate logp*(x) and logp(x). For all the models and for 


small a number of samples K we significantly underesti¬ 
mate the LL; but, in comparison to RWS, the estimates for 
the BiHM model are much higher and less sensitive to K. 
E.g, using K= 10 samples to evaluate the BiHM model re¬ 
sults in a higher LL estimate than using K- 10,000 samples 
to evaluate the RWS model. 

Computational cost To demonstrate the computational 
efficiency of our approach we show typical MNIST learn¬ 
ing curves in Fig. 4. For BiHM, RWS and VAE the learning 
rate was chosen within a factor of 2 to obtain optimal re¬ 
sults after 10 6 update steps (5 • 10 -4 for BiHM and RWS, 
3 • 1(T 3 for VAE; K= 10 for BiHM and RWS). For the 
IWAE experiment we use the original code, hyperparam¬ 
eters and learning rate schedule from (Burda et al., 2015): 
This experiment thus uses a mini-batch size of 20 instead 
of 100, K =5 training samples and and 8 different learning 
rates over the course of « 3300 epochs. In all cases we used 
K -1000 samples to evaluate the test set log-likelihoods. 
We generally observe that BiHM show very good conver¬ 
gence in terms of progress per update step and competi¬ 
tive performance in terms of total training time. Note that 
BiHMs and RWS allow for an efficient distributed imple¬ 
mentation in the future: per sample, only the binary acti¬ 
vations and a single floating point number (the importance 
weight) need to be communicated between layers. VAEs 
and IWAEs need to communicate continuous activations 
during the forward pass and continuous partial gradients 
during the backward pass. At test time BiHMs are typi¬ 
cally much more effective than the other methods: BiHMs 
obtain good LL estimates with K -10 or 100 samples per 
datapoint while VAE, RWS and IWAE models require « 
10,000 samples to obtain competitive results (compare Fig. 
1 C). 

5. Conclusion and future work 

We introduced a new scheme to construct probabilistic gen¬ 
erative models which are automatically regularized to be 
close to approximate inference distributions we have at our 
disposal. Using the Bhattacharyya distance we derived a 
lower-bound on the log-likelihood, and we demonstrated 
that the bound can be used to fit deep generative models 
with many layers of latent variables to complex training 
distributions. 

Compared to RWS, BiHM models typically prefer many 
more latent layers. After training a BiHM, the directed 
top-down model p shows better performance than a RWS 
trained model; both in terms of log-likelihood and sample 
quality. Sample quality can be further improved by approx¬ 
imately sampling from the full undirected BiHM model p*. 
The high similarity between p* and q , enforced by the train¬ 
ing objective, allows BiHMs to be evaluated orders of mag- 




Figure 4. Learning curves for MNIST experiments: For BiHM, RWS and VAE we chose the learning rate such that we get optimal 
results after 10 6 update steps; for IWAE we use the original learning rate schedule published in (Burda et al., 2015). BiHM and RWS 
use K -10 samples per datapoint; IWAE uses K =5 and a batch size of 20 (according to the original publication). We generally observe 
that BiHM show very good convergence in terms of progress per update step and in terms of total training time. 



Figure 5. Results after training on TFD: A) Random selection of 12 samples drawn from p*(x) (10 iterations of Gibbs sampling). B) 
The left column in each block shows the input; the right column shows a random output sample generated by the inpaiting algorithm 
(see section 3). 


nitude more efficiently than RWS, VAE and IWAE models. 

Possible directions for future research could involve 
semisupervised learning: the symmetric nature of the gen¬ 
erative model p* (it is always close to the bottom-up and 
top-down directed models q and p ) might make it par¬ 
ticularly interesting for learning tasks that require infer¬ 
ence given changing sets of observed and hidden variables. 
We also have a wide range of potential choices for our 
parametrized conditional distributions. Assuming continu¬ 
ous latent variables for example and eventually choosing an 
alternative inference method might make p* a better suited 
model for some training distributions. 
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